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Abstract — The notion of meta-mining has appeared recently 
and extends the traditional meta-learning in two ways. First 
it does not learn meta-models that provide support only for 
the learning algorithm selection task but ones that support 
the whole data-mining process. In addition it abandons the so 
called black-box approach to algorithm description followed 
in meta-learning. Now in addition to the datasets, algorithms 
also have descriptors, workflows as well. For the latter two 
these descriptions are semantic, describing properties of the 
algorithms, such as cost functions, learning biases, etc. With 
the availability of descriptors both for the datasets and the 
data-mining workflows the traditional modelling techniques 
followed in meta-learning, typically based on classification and 
regression algorithms, are no longer appropriate. Instead we 
are faced with a problem the nature of which is much more 
similar to the problems that appear in recommendation sys- 
tems. However on the same time the requirements of the meta- 
mining tasks make the direct use of tools from recommender 
systems rather inappropriate. The most important meta-mining 
requirements are that suggestions should use only the datasets 
and workflows descriptors and the cold-start problem, e.g. 
providing workflow suggestions for new datasets. 

In this paper we take a different view on the meta-mining 
modelling problem and treat it as a recommender problem. In 
order to account for the meta-mining specificities we derive 
a novel metric-based-learning recommender approach. Our 
method learns two homogeneous metrics, one in the dataset 
and one in the workflow space, and a heterogeneous one in the 
dataset-workflow space. All learned metrics reflect similarities 
established from the dataset-workflow preference matrix. The 
latter is constructed from the performance results obtained by 
the application of workflows to datasets. We demonstrate our 
method on meta-mining over biological (microarray datasets) 
problems. The application of our method is not limited to the 
meta-mining problem, its formulations is general enough so 
that it can be applied on problems with similar requirements. 

Keywords -Meta-Mining; Meta-Learning; Hybrid Recom- 
mendation; Metric-Learning; 

I. Introduction 

Meta-learning is learning to learn: in computer science, it 
is the application of machine learning techniques to meta- 
data describing past learning experience, typically applica- 
tions of learning algorithms to specific datasets, in order to 
derive meta-learning models that can support the selection 
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of an appropriate algorithm for a new dataset, fl], ||2l, 
ID, E). The meta-learning models are usually classification 
or regression models learned by standard classification and 
regression algorithms. Until very recently meta-learning was 
focusing only on the learning part of the data mining 
process, by trying to model the behavior of different learning 
algorithms, and was treating the learning algorithms as 
black-boxes making no effort to describe the concepts that 
underline them and their properties. 

The authors of ISJ made an effort to address these limi- 
tations by extending the meta-learning process to the whole 
data mining process resulting in a more comprehensive task 
which they called meta-mining. In addition they made use 
of a data mining ontology in order to provide detailed 
descriptions of data mining algorithms in terms of their 
core components, underlying assumptions, cost functions, 
optimization strategies, etc, as well as detailed descriptions 
of data mining workflows, the latter composed of oper- 
ators implementing data mining algorithms. Even though 
the introduction of data mining algorithm and workflow 
descriptors was an important step the authors made rather 
poor use of them by modelling the meta-mining problem as 
a classification problem, following thus the traditional meta- 
learning modelling approach. In this classification problem 
the meta-mining instances corresponded to data mining 
experiments, appUcations of workflows or algorithms on 
datasets, and they consisted of two types of features, features 
that described the dataset and features that describe the data 
mining workflow. The class label was determined on the 
basis of the performance result estimated by the application 
of the workflow on the dataset and was indicating the 
appropriateness or not of the workflow for the dataset. 

In this paper we take a different approach on the mod- 
elling of the meta-learning and meta-mining tasks. We view 
them as a matching problem between datasets on the one 
hand and data mining algorithms or workflows on the other, 
in which the matching criterion is the performance of the 
latter when applied on the former We will address three 
different meta-mining tasks. Given a new dataset we want 
to recommend or rank available algorithms or data mining 



workflows in terms of flieir expected performance on the 
specific dataset; we wifl call this task learning workflow 
preferences. Symmetrically to this we want, given a new data 
mining workflow or algorithm, to know for which datasets 
they are most appropriate; we will call this task learning 
dataset preferences. Finally, given a new dataset and a new 
workflow or algorithm we want to be able to determine the 
goodness of their match, i.e. the degree to which the latter 
will have a good performance when applied to the former; 
we will call this learning dataset-workflow preferences. It 
is obvious that all these should be determined without any 
actual application of the new algorithms on the new datasets 
but on the basis of some meta-mining model that will be 
learned from the past mining experiences. 

These type of problems are similar in nature to problems 
that appear in recommender systems, where we have users 
and items and we want to suggest additional items for a 
given user based on the preferences of users with similar 
preferences. In the meta-mining and meta-learning case the 
matrix containing the preferences of users for items is 
replaced by a performance based matrix of datasets and 
workflows or algorithms that indicates the performance of 
the latter applied to the former This performance-based 
preference matrix will be one component of our meta- 
mining data; in addition we will use dataset and workflow 
or algorithm descriptors. The final meta-mining models will 
only use the dataset and workflow descriptors to return the 
preferences. In recommender systems there is a relevant 
stream of work that makes use of descriptors of users and 
items, similar to the descriptors of datasets and workflows, 
that is called hybrid recommendation systems 16], Q, [S), 
||9l . However there are also a number of differences between 
the nature of the recommendation problem that we have in 
meta-mining and the typical recommendation problems. In 
the latter the preferences matrix is often very sparse, of 
high dimensionality, and can have hundreds of thousands 
of rows/users. In contrast, the preferences matrix in meta- 
mining is rather dense and involves few hundreds of datasets 
and workflows. The features of datasets and workflows that 
we use in the meta mining problems are quite informative 
in contrast to the typical recommendation problems where 
it is rather hard to get informative features especially in 
what concerns user descriptions. Finally, in the meta mining 
setting, the cold-start problem is central, with the most 
typical example being predicting the workflow preferences 
for a new dataset. However, in recommendation problem, 
partly due to the low information content of the features 
describing users, but also due to the nature of the problem 
itself, the main focus is in the completion of missing values 
in the preferences matrix based on historical ratings of items 
by similar users. 

In this paper we present a new metric-learning-based 
approach to hybrid recommendation for meta-mining, which 
learns to match dataset descriptors to workflow descriptors. 



More specifically we will learn three different metrics. One 
on the dataset descriptor space which will reflect the fact that 
similar datasets will have similar workflow preferences, as 
these are given by the performance-based preference matrix. 
One on the workflow descriptor space which will reflect 
the fact that similar workflows will have similar dataset 
preferences again as these are given by the performance- 
based preference matrix. And a last heterogeneous metric 
over the two spaces of dataset and workflow descriptors, 
which will directly give the similarity/appropriateness of a 
given dataset for a given workflow. We will use these learned 
metrics, alone or in combination, to address the three meta- 
mining tasks that we described in the previous paragraphs. 

To the best of our knowledge the metric learning ap- 
proach that we present is the first of its kind, not only 
for meta-mining, but also for the general context of hybrid 
recommendation problems. Even though it was developed to 
address the specific requirements of the meta-mining setting 
it is not specific to it, and it can be used in any kind 
of recommendation system that has similar requirements, 
i.e. preference based matchings of users to items based on 
descriptions of them and cold-start problem. 

The rest of the paper is organized as follows. In section 
im we define the meta-mining tasks. In section |III1 we 
describe our metric-learning based approach to the problem 
of learning hybrid recommendations for meta-mining. In 
sectionHy] we present briefly the characteristics — features — 
that we use to describe the datasets and the workflows. In 
section [V] we give the experiments and the evaluation of 
our approach. In section IVII we discuss the related work 
and finally we conclude in section IVIII 

II. Meta-Mining Tasks 

Before proceeding to the definition of the different meta- 
mining tasks that will address let us give some necessary 
notations. Let x = {xi, . . . Xd)^ E M'^ be the description 
of some dataset, and X an 71 x d dataset matrix the ith 
row of which is given by the xf dataset. Thus the X 
matrix is the set of datasets over which the meta-mining 
will take place. In addition let a = (ai, . . . ,a;)^ G be 
the description of some data mining workflow, and A an 
m X I workflow matrix the jth row of which is the aj 
workflow, i.e. A will be the data mining workflow matrix 
over which the meta-mining will take place. Finally let R 
be an n X TO matrix the entry of which depends on 

some performance result obtained by the application of the 
aj data mining workflow on the X; dataset. We will use 
the notation r^i to denote the vector given by the row of 
R which corresponds to the x^ dataset and which contains 
the performance measures obtained by the application of 
the m data mining workflows on x^, and the notation ra 
to denote the vector given by the jth column of R which 
contains the performance results of the application of the aj 
data-mining workflow on the n datasets. Thus the R matrix 



relates, based on performance, datasets with workflows and 
can be seen as giving the appropriateness of workflows for 
datasets and vice versa. 

Since here we will focus only on meta-mining for classi- 
fication problems the performance measure that we will be 
using to fill up R will be based on classification accuracy 
which we will estimate by ten-fold cross-validation. The 
accuracies achieved by different workflows are not compara- 
ble over different datasets, what is much more important in 
meta-learning and meta-mining is the relative performance 
order of a set of data mining workflows or algorithms on 
a given dataset; this relative order can be compared in a 
meaningful manner over different datasets. We devise such 
a relative order in the following way. Given a pair of 
classification data mining workflows afc and a; applied on 
dataset we compute the statistical significance of their 
accuracies differences using a McNemar's test, with a p- 
value of 0.05. If one workflow is statistically significant 
better than the other it is assigned a score of one and the 
other a score of zero, in case of no significant difference both 
are assigned a score of 0.5. For a given dataset x; the score 
of a workflow will be the sum of the points it gets in all 
its pairwise comparisons with the other in — l workflows. It 
is this score that we will use to populate the R matrix, i.e. 
its entry will be the score obtained by workflow a.j on 
dataset x^, we will also use the notation rx;,a to denote the 
entry of R. 

Given the above we will now define three different meta- 
mining tasks. In the first one given a new unseen dataset x, 
i.e. a dataset with which we have not experimented with, 
we want to estimate the relative performance order of the m 
data mining workflows. In other words we want to estimate 
the relative workflow performance, or workflow preference, 
vector Tx for the x dataset. We will call this task learning 
workflow preferences. The second meta-mining task is the 
symmetric of the first; here we want to estimate appropriate- 
ness of a new unseen workflow a for the n datasets, i.e. we 
want to estimate the dataset preference vector ra for the a 
workflow. We will call this task learning dataset preferences. 
Finally in the third, last and most difficult, meta-mining 
task we want to estimate the appropriatness of an unseen 
workflow a on an unseen dataset x, i.e. estimate the 7'x,a 
value. We will call this meta-mining task learning dataset- 
workflow preferences. 

To address all three tasks we will rely on the use of 
appropriate similarity measures. To learn workflow prefer- 
ences we will need a dataset similarity measure that given 
a new dataset x will establish its most similar datasets in 
the training set X. From the workflow preference vectors of 
these datasets we will then estimate the workflow preference 
vector Tx of x. In the same manner to learn dataset prefer- 
ences we need a workflow similarity measure that given a 
new workflow a will establish its most similar workflows in 
the training set A. From the dataset preference vectors of 



these workflows we will then estimate the dataset preference 
vector Ta of a. For the last task we will rely on the use of 
an heterogeneous similarity measure that computes directly 
similarities between workflows and datasets, which thus 
given an unseen dataset x and an unseen workflow a will 
produce the rx,a corresponding to the appropriateness of a 
for X. 

In the following section we will show how to learn 
appropriate metric matrices that we will use to compute the 
similarity measures that we briefly described in the previous 
paragraph. 

III. Learning Similarities for 
Hybrid-Recommendations in Meta-Mining 

Before starting to describe in detail how we will address 
the three meta-mining tasks let us take a step back and give 
a more abstract picture of the type of learning setting that we 
want to address. We have two types of learning instances, 
X E X, and a E A, and two training matrices X : n x d and 
A : mxl respectively. Additionally we also have an instance 
alignment or preference matrix R : nx m, the Rij entry of 
which gives some measure of appropriateness, preference, 
or match of the x^ and aj instances. 

We can construct a similarity matrix for the instances 
of the X by exploiting the idea that similar instances of 
the X should have similar preferences with respect to the 
instances of the A matrix. Here we do not rely anymore 
in the original representation of the x instances in order to 
define their similarities but on their preferences with respect 
to the a instance^. So the x instances similarity matrix 
will be the RR^ matrix, the [RR^J^ entry of which wiU 
give the similarity of the x; and x^ instances. In exactly the 
same manner we can construct the similarity matrix for the 
a instances as R^R. 

We now want to learn two Mahalanobis metrics one 
in the X and one in the A space which will reflect the 
instance similarities as these are given by the RR-'^ and 
R^R similarity matrices respectively. In addition we want 
to learn a third metric over the two heterogeneous spaces 
X and A which will reflect the similarity/preference of an 
Xi G X instance to an a.j £ A instance as this is given 
by the Rij preference value. Since learning a Mahalanobis 
metric is equivalent to learning a linear transformation we 
will see in the following paragraphs that what we actually 
need to learn is eventually two such linear transformations, 
one for the X and one for the A space, which will optimize 
the three objective functions that we just sketched. 

We should note here that the setting that we just described 
is not specific to the meta-mining context but is also relevant 
for any recommendation problem with similar requirements. 

'This reflects one of the basic assumptions in metalearning, the fact that 
what we are trying to reflect is a similarity of datasets in terms of the relative 
performance/appropriateness of different learning paradigms/algorithms for 
them 



To the best of our knowledge the metric-based solution 
which we will present right away is the first of its kind 
for such settings. 

A. Learning a dataset metric 

We will now describe how to learn a Mahalonobis metric 
matrix W/^^ in the X dataset space in a manner that will 
reflect datasets similarity in terms of the similarity of their 
workflow preference vectors. Instead of using the RR^ 
matrix to establish the similarity of two datasets in terms of 
their preference vectors, under which the dataset similarity is 
simply the inner product of the workflow preference vectors, 
we will rely on the Pearson rank correlation coefficient of 
these preference vectors. The latter is a more appropriate 
measure of dataset similarity since it focuses on the relative 
workflow performance which is more relevant when one 
wants to measure dataset similarity. Nevertheless to simplify 
notation we will continue using the RR^ notation. 

We define the following metric learning optimization 
problem: 

minFi(WA') = \\KBJ -^V^x^^\\l + ^iitriWx) 

s.t. Wx h 

where ||.[|f is the Frobenius matrix norm, tr{.) the matrix 
trace, and /^i > is a parameter controlling the trade-off 
between empirical error and the metric complexity used to 
control overfitting, which is a convex optimization problem. 
As already mentioned learning a Mahalanobis metric matrix 
is equivalent to learning a linear transformation of the 
original feature space. Thus we can now rewrite our metric 
learning problem with the help of a linear transformation as: 

rainFi(U) = | jRR^ - XUU^X^I |^ + /ii | |U| |^1) 

where W;f = UU-^ is the d x d metric matrix, and U an 
associated linear transformation with dimensionality d x t 
which projects the dataset description to a new space of 
dimensionality t. Unlike the previous optimization problem 
this is no longer convex. We will work with optimization 
problem ^ because it will make easier the variable sharing 
between the different optimization problems that we will 
define. We solve it using gradient descent. 

Using the learned metric the similarity of two datasets 
and Xj is a;(x,;, Xj) = XiUU^Xj. Given some new dataset x 
we will use this similarity to establish the set A'x consisting 
of the N datasets that are most similar to x with respect to 
the similarity of their relative workflow preferences as this 
is computed in the original feature space X. With the help 
of A^x we can now compute the workflow preference vector 
of X as the weighted average of the workflow preference 
vectors of its nearest neighbors by: 

i"x = Cx ^ i"x, a;A'(x,Xj) (2) 



where (^x is a normaUzation factor given by Cx = 
J2x ex '^xi^i^i)- Thus using the learned metric we can 
compute the workflow preference vector for a new dataset 
by computing its similarity to the training datasets in the 
X feature space, similarity that was learned in a manner 
that reflects the datasets similarity in terms of their relative 
workflow preferences. 

B. Learning a data mining workflow metric 

To learn a Mahalanobis metric matrix W in the A data 
mining workflow space we will proceed in exactly the 
same manner as we did with the datasets using now the 
R^R matrix the elements of which will give the rank 
correlation coefficients of the dataset preference vectors of 
the workflows, measuring thus the similarity of workflows 
in terms of their relative performance over the different 
datasets. More precisely as before we start with the metric 
learning optimization problem: 

minF2(W^) = IIRTR - AW^A^H^, + ^itr(W^) 
s.t. Wa h 

which we cast to the problem of learning a linear transfor- 
mation V in the workflow space as: 

imnF2(V) - IIRTR- AVV^AT||2, + ^ti||V|||(3) 

where — VV-^ is the I x I metric matrix, and V an 
associated Unear transformation with dimensionality I x t 
that projects workflow descriptions into a new space of t 
dimensionality. As before this is not a convex optimization 
problem. We will solve it using gradient descent. Similar 
to the dataset case using the learned metric the similarity 
of two workflows and a.j is cj^(ai,aj) = aiVV-^a^. 
Given some new workflow a its workflow neighborhood A^a 
consists of the N workflows that are most similar to a with 
respect to the similarity of their relative dataset preferences 
as this is computed in the original feature space A. With 
the help of A'a we can now compute the dataset preference 
vector of a as the weighted average of the dataset preference 
vectors of its nearest neighbors by: 

Ta^Ca^ l"a, t^yl(a,aj) (4) 

where Ca is a normalization factor given by = 
w_4(a, a;). Thus using the learned metric we can 
compute the dataset preference vector ra for a new workflow 
by computing its similarity to the training workflow in the A 
feature space, similarity that was learned in a manner that 
reflects the workflows similarity in terms of their relative 
dataset preferences. 



C. Learning a heterogeneous metric over datasets and 
workflows 

The last metric that we want to learn is one that will relate 
datasets to data mining workflows reflecting the appropriate- 
ness/preference of a given workflow for a given dataset in 
terms of the relative performance of the former applied to 
the latter We will do so by starting with the following metric 
learning optimization problem 

minF3(W) = WR-XWA^\\% + fiitr{W) 
w 

s.t. W to 

which if we parametrize the d x I metric matrix W with the 
help of two linear transformation matrices U and V with 
dimensions d x t and I x t can be rewritten as: 

minF3(U,V) ^ ||R - XUV^A'^|||, + (5) 

//l||U|||,+/i2||V|||, 

Essentially what we do here is to project the descriptions 
of datasets and workflows to a common space with di- 
mensionality t over which we compute their similarity in 
a manner that reflects the preference matrix R. We will 
set t to the mm{rank{A),rank{'X.)). In other words we 
learn a heterogeneous metric which computes similarities of 
datasets and workflows in terms of the relative performance 
of the latter when applied on the former. Using the new 
similarity metric we can now compute directly the match 
between a dataset x and a workflow a as: 

rx,a = xUV^a (6) 

Clearly we can use this not only to determine the goodness 
of match between a dataset and a data mining workflow but 
also given some dataset and a set of workflows to order 
the latter according to their appropriateness with respect to 
the former, thus solving the meta-mining task 1, and vice 
versa given a workflow and a set of datasets to order the 
latter according to their appropriateness for the former thus 
solving meta-mining task 2. 

In the objective function of the optimization problem Q 
we focus exclusively on trying to learn a metric that will re- 
flect the appropriateness of some workflow for some dataset 
as this is given by the entries of the R preference matrix. 
However there is additional information that we can bring 
in if we exploit the objective functions of the optimization 
problems ([T]) and (O and use them to additionally regularize 
the objective function of (|5]). The overall idea here is that we 
will learn three different metrics in the spaces of datasets, 
workflows, and datasets-workflows, all of them parametrized 
by two linear transformations in a manner that will reflect the 
basic meta-mining assumptions, namely that similar datasets 
should have similar workflow preference vectors, similar 
workflows should have similar dataset preference vectors, 
and that the heterogeneous metric between datasets and 



workflows should reflect the appropriateness of datasets for 
workflows. By combining the three optimization problems 
of ([T]), (O, and ^ we get the following metric learning 
optimization problem that achieves these goals: 

minF4(U,V) = aFi(U) + ^F2(V) + 7F3(V, U|7) 

= a||RR^ -XUU^X^Ill, 
+ /SIIR'^R- AVV'^AT||| 
+ 7||R-XUV^AT||| 

where a, (3, 7, are positive parameters that control the im- 
portance of the three different optimization terms. As it was 
the case with optimization problem ^ this optimization 
problem can also be used to address all three meta-mining 
tasks. In fact ^ is the most general formulation of the 
metric-learning based hybrid reccomendation problem and 
includes as special cases problems ([T]) and 

Matrix factorization, often used in recommender systems, 
also learns a decomposition of a matrix to component 
matrices U and V under different constraints. However, 
by its very nature it cannot handle well the out-of-sample 
problem. The objective function of problem (|7]i uses as 
additional constraints the objective functions of ([T]) and (O 
and learns a common space for the datasets and workflows, 
which are induced by the projection matrices U and V. As 
a result, the out-of-sample problem, i.e. cold start problem 
in recommender system, is naturally handled by the opti- 
mization problem 

IV. Dataset and workflow descriptors 

In the following two sections we will describe the dataset 
and workflow descriptors that we will use in our meta- 
mining experiments. 

A. Dataset Descriptors 

Originally proposed by the STATLOG project ifTOl . the 
idea of characterizing datasets has been the main stream in 
meta-learning during these last decades ifTTI . lfT2l . ifTSll . lfT4ll . 
Various characterizations have been subsequently proposed, 
from which we have selected the most relevant ones sum- 
marized as follows: 

statistical measures: number of instances, number of classes, 
proportion of missing values, proportion of continuous / 
categorical features, noise signal ratio. 
information-theoretic measures: class entropy, mutual infor- 
mation. 

geometrical and topological measures 115| : non-linearity, 
volume of overlap region, maximum fisher's discriminant 
ratio, fraction of instance on class boundary, ratio of average 
intra/inter class nearest neighbour distance. 
model-based measures: error rates and pairwise \ —p values 
obtained by landmarkers lfT6l such as ZeroR, one-nearest- 
neighbor. Naive Bayes, Decision Stumps iflTl . Random 
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Figure 1. Two workflow patterns with cross-level concepts. Thin edges 
depict workflow decomposition; double lines depict DMOP's concept 
subsumption. 



Trees HS), and the linear SVM |[T9l, and the distributions 
of the weights learned by the Relief lEO) and SVMRFE EB 
feature selection algorithms. 

Overall, we use a large spectrum of dataset characteristics, 
from very simple ones such as the number of instances to 
more elaborated ones such as the model-based measures, 
giving a total number of d ~ 150 dataset characteristics. 

B. Workflow descriptors 

The ability to describe data mining algorithms and work- 
flows and use these descriptors for meta-learning and meta- 
mining is a very recent development Q. There the authors 
used DMOP, a data mining ontology, to describe learning 
algorithms and data-processing algorithms such as feature 
selection, discretization and normalization, with respect to 
the mathematical concepts they implement and different 
properties, such as their bias/variance profile, their sensitivity 
to the type of attributes, their learning strategy, etc. In 
addition the same ontology allows to anotate operators (algo- 
rithm implementations) of data mining workflows with their 
respective concepts. A data mining workflow is typically a 
direct acyclic graph of data mining operators. 

In order to describe the data mining workflows we follow 
the propositionalization approach used in Q. We derive 
from the annotated direct acyclic graphs that describe the 
data mining workflows a set of frequent closed workflow 
patterns using the tree-structured apriori algorithm of 1221 . 
The description of a workflow is then given by a binary 
vector that indicates the presence or absence of each of 
the frequent patterns; the final workflow description contains 
I = 214 features. In figure[T]we give two examples of work- 
flow patterns that have been abstracted from ground feature 
selection -i- classification workflows based on DMOPs al- 
gorithm hierarchy. These patterns help us understand how 
the workflow space is structured by describing frequent 
workflow structures using the DMOP concepts. 

V. Experiments 

In this section, we will perform a systematic evaluation 
to examine the performance of the different metric learning 
optimization problems for meta-mining that we presented 
in the previous sections. More precisely we will evaluate 



the performance of the dataset metric learning optimization 
problem given in ([T]) to the meta-mining task of learning 
workflow preferences for a given dataset; the performance 
of the workflow metric learning optimization problem of 
(O to the meta-mining task of learning dataset preferences; 
and finally the performance of the two metric learning 
optimization problems, (|5]l, (|7]i, for all three meta-mining 
tasks. 

A. Base-level Experiments 

In order to meta-mine we first need to perform a set of 
base-level experiments over which we will construct our 
meta-mining models. To do so we used 65 real world cancer 
microarray datasets, most of them were taken from the Na- 
tional Center for Biotechnology Information 0. Microarray 
datasets are characterized by a high-dimensionality and a 
small sample size, and a relatively low number of classes, 
most often two. These datasets have an average of 79.26 
instances, 15268.57 attributes, and 2.33 classes. On these 
datasets we applied a total of 35 classification data mining 
workflows; 28 of them were workflows that contained one 
feature selection and one classification algorithm, while 
the seven remaining ones had only a single classification 
algorithm. We used the four following feature selection algo- 
rithms: Information Gain, IG, Chi-square, CHI, ReliefF 1201 . 
RF, and recursive feature elimination with SVM l2TI . SVM- 
RFE, and fixed the number of selected features to ten. 
For classification we used the seven following algorithms: 
one-nearest-neighbor, INN, the C4.5 and CART ^ 
decision tree algorithms, a Naive Bayes algorithm with 
normal probability estimation, NBN, a logistic regression 
algorithm, LR, and SVM (Tgl with the linear, SVMi and 
the rbf, SVMr, kernels. We used the implementations of 
these algorithms provided by the RapidMiner data mining 
suite with their default parameters. Overall we had a total 
of 65 X (28 + 7) = 2275 base-level DM experiments, i.e. 
applications of these workflows on the datasets. To construct 
the R preference matrix we estimated the performance of 
the workflows using 10-fold cross-validation and applied 
the scoring McNemar based scoring schema described in 
section HI] In table I] we give for each of the ten top 
workflows over the full set of 65 datasets the number of 
times that these were ranked in the top five positions. 

B. Baseline Strategies and Evaluation Methodologies 

In order to assess how well the different variants perform 
we need to compare them with some default and baseline 
strategies. For the meta-mining task of workflow preference 
learning, we will use as the default strategy the preference 
vector given by average of the workflow preference vectors 
over the different training datasets for a given testing dataset. 
We should note that this is a rather difficult baseline to 

^http://www.ncbi. nlm.nih.gov/ 
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Table I 

Default TOP- 10 WORKFLOWS and their frequency in the top-5 positions. 



Learning Workflow preferences 
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t5p 


mae 


def 


0.332 


77.8 


4.50 


EC 
S 


0.356 

32/65 p=l 


77.8 

32/65 p=l 


4.39 

37/65 p=0.321 


Fi 
<5 


0.366 

34/65 p=0.804 
35/65 p=0.620 


78.6 

33/65 p=l 
33/65 p=l 


4.83 

40/65 p=0.082 
20/65 p=0.003 


Fs 

s 

Sec 


0.286 

23/65 p=0.025 
19/65 p=0.001 


77.1 

23/65 p=0.025 
27/65 p=0.215 


5.64 

19/65 p=0.001 
14/65 p=le-6 
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Sec 
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40/65 p=0.082 
42/65 p=0.025 


79.3 
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42/65 p=0.025 



Learning Dataset preferences 
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Fi 
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NA 
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1070/2275 p=0.005 


Sec 


24/35 p=0.041 


18/35 p=l 









Table 11 

Evaluation RESULTS. S and Sec denote comparison results with the default (def) and the Euclidean baseline strategy (EC) 

RESPECTIVELY, p IS THE SPEARMAN' S RANK CORRELATION COEFFICIENT, IN T5P WE GIVE THE AVERAGE ACCURACY OF THE TOP FIVE WORKFLOWS 
PROPOSED BY EACH STRATEGY, AND MAE IS THE MEAN AVERAGE ERROR. X/Y INDICATES THE NUMBER OF TIMES X THAT A METHOD WAS BETTER 

OVERALL THE EXPERIMENTS Y THAN THE DEFAULT OR THE BASELINE STRATEGY. 



beat since the different workflows will be ranked according 
to their average performance on the training datasets, with 
workflows that perform consistently well ranked on the top. 
For the second task of providing a dataset preference vector 
for a given testing workflow we have a similar default 
strategy, we will use the average of the dataset preference 
vectors over the different training workflows. However this 
strategy for the workflows leads to a trivial constant vector 
of dataset preferences due to the fact that the total sum of 
workflow points for a given dataset is fixed to m(m — l)/2, 
when we compare m workflows, by the very same nature of 
the workflow ranking schema for a given dataset. Finally for 
the last meta-mining task we will use as the default strategy 
for the prediction for the appropriateness of a workflow for a 
dataset the average over the values of the preference matrix 
of the training set. We will denote the default strategy used in 
the three meta-mining tasks by def. In addition we will also 
have as a baseline strategy the provision of recommendation 
when we use a simple Euclidean distance, i.e. all attributes 
are treated equally and there is no learning, which we will 
denote by EC. However this baseline is only applicable to the 
first two meta-mining tasks, learning workflow preferences 
and learning dataset preferences, since it cannot be applied 
to the kind of heterogeneous similarity problem that we have 
in the third meta-mining task. 

As resampling techniques we will use leave-one-dataset- 
out to estimate the performance on the workflow preference 
learning task, leave-one-workflow-out for the dataset prefer- 
ence learning task, and leave-one-dataset-and-one-workflow- 
out for the third task of predicting the appropriateness of a 



workflow for a dataset. 

To quantify the performance we will use a number of 
evaluation measures. For the first two meta-mining tasks 
we will report the average Spearman's rank correlation 
coefficient between the predicted preference vector and the 
real preference vector over the testing instances. We will 
denote this average by p. This measure will indicate the 
degree to which the different methods predict correctly 
the preference order. Note that this quantity is not com- 
putable for the default strategy in the case of the learning 
dataset preferences task, due to the fact that the dataset 
preference vector that it produces is fixed, as we explained 
previously, and the Spearman rank correlation coefficient is 
not computable when one of the two vectors is fixed. In 
addition to the Spearman rank correlaction coefficient for the 
meta-mining task of learning workflow preferences we will 
also report the average accuracy of the top five workflows 
suggested by each method, measure which we will denote 
by t5p. Finally for the three meta-mining tasks we will 
also report the mean average error, mae, over the respective 
testing instances, of the predicted values for Tx, ra, and 
Tx.a, for learning workflow preferences, dataset preferences, 
and dataset-workflow preferences, respectively, and the true 
values. For each measure, method, and meta-mining task, 
we will give the number of times that the method was better 
than the respective default and baseline strategies over the 
total number of datasets, workflows, or dataset, workflow 
pairs (depending on the meta-mining task), as well as the 
statistical significance of the result under a binomial test 
with a statistical significance level of 0.05. The comparison 



results with the default strategy will be denoted by 6 while 
the comparison to the Euclidean baseline by Sec- 

C. Experiment Results on the Biological Datasets 

We will now take a close look on the experimental results 
for the different meta-mining tasks and objective functions 
that we have presented to address them. The full results are 
given in Table HIl 

Learning Workflow Preferences: Learning algorithm 
preferences is the most popular formulation in the traditional 
stream of meta-learning. There given a dataset description 
we seek to identify the algorithm that will most probably 
deliver the best results for the given dataset. In that sense 
this meta-mining task is the most similar to the typical meta- 
learning task. We have presented three different objective 
functions that can be used to address this problem. Fi, 
optimization problem ([TJ, makes use of only the dataset 
descriptors and learns a similarity measure in that space 
that best approximates their similarity with respect to their 
relative workflow preference vectors. In traditional meta- 
learning this similarity is computed directly in the dataset 
space, it is not learned, and most importantly it does not try 
to model the relative workflow preference vector, ifTTI . 1251 . 
In our experimental setting the strategy that implements this 
traditional meta-learning approach is the EucUdean distance- 
based dataset similarity, EC. In addition to the homogeneous 
metric learning approach we can also use the two heteroge- 
neous metric learning variants to provide the workflow pref- 
erences. The simplest one, corresponding to the optimization 
function F3, optimization problem |5] uses both dataset and 
workflow characteristics and tries to directly approximate the 
relative preference matrix. However this approach ignores 
the fact that the learned metric should reflect two basic 
meta-mining requirements, that similar datasets should have 
similar workflow preferences, and that similar workflows 
should have similar dataset preferences. The optimization 
function F4, optimization problem |7] reflects exactly this 
bias by regularizing appropriately the learned metrics in 
the dataset and workflow spaces so that they reflect well 
the similarities of the respective preference vectors. Before 
discussing the actual results, given in the left table of 
Table [III we give the parameter settings for the different 
variants. Fi: ^1 = 0.5, N^^ = 5; F3: Hi = ^2 = 0.5; F^: 
a = le-i°, /3 = le-3, 7 = le"^^ = j^g, H2 = 0. These 
parameters reflect what we think are appropriate choices 
based on our prior knowledge of the meta-mining problem. 
Better results would have been obtained if we had tuned, at 
least some of them, via inner cross validation. 

Looking at the actual results we see right away that the 
approach that makes use of only the dataset characteristics, 
Fi, has a performance that is not statistically significant 
different neither from the default, nor from the EC baseline 
with respect to the Spearman's rank correlation coefficient, 
p, and the average accuracy of the top five workflows 



it suggests, tp5. In addition it is statistically significant 
worse than the EC with respect to the mean average error 
criterion, mae, having a lower mae value than EC only in 
20 out of the 65 datasets. Looking at the performance of 
the heterogeneous metric that tries to directly approximate 
the preference matrix R, we see that its results are quite 
disappointing. It is significant worse than the default strategy 
and the EC baseline for almost all performance measures. 
So trying to learn a heterogeneous metric that relies ex- 
clusively on the approximation of the preference matrix 
is definitely not an option. However when we turn to the 
F4 objective function that learns the heterogeneous metrics 
in a manner that they do not only reflect the preference 
manner, but also the fact that similar datasets should have 
similar workflow preferences and vice versa, there we see 
that the performance we get is excellent. F4 beats in a 
statistically significant manner both the default strategy as 
well as the EC baseline in almost cases, the only exception is 
the Spearman's correlation coefficient comparison with the 
default where the level of significance is high, p = 0.082, 
but does not overpass the significance threshold of 0.05. 
Overall in such a recommendation scenario the best strategy 
consists in learning a combination of the two homogeneous 
and one heterogeneous metrics that reflect the similarities 
of the datasets with respect to the workflow preferences, 
the similarities of the workflows with respect to the dataset 
preference vectors, as well as the similarities of workflows- 
datasets according to the preference matrix. 

Learning Dataset Preferences: The goal of this meta- 
mining task is given a new workflow and a collection of 
datasets to provide a dataset preference vector that will 
reflect the order of appropriateness of the datasets for the 
given workflow. As already mentioned the default strategy 
provides here a vector of equal ranks thus we cannot 
compute its Sperman's rank correlation coefficient. We will 
compare the performance of the F2 objective function that 
makes use of only of the workflow descriptors when it 
tries to approximate the similarity of the dataset preference 
vectors, and these of F3 and F^. We used the following 
parameter: F2: /ii = 10, A^a„ = 5; F3: pi = /i2 = 10; F4: 
a = le-i°, /3 ^ le-3, 7 = ig-s, = 0.5, ^X2 = 0. 
Looking at the results, middle table of Table [III we see 
that when it comes to the mean average error, all methods 
achieve a performance that is statistically significant better 
than that of the default strategy, suggesting that this meta- 
mining task is probably easier than the first one. This makes 
sense since it is easier to describe a workflow similarity in 
terms of the concepts that these workflows use, than what 
it is to describe a dataset similarity in terms of the datasets 
characteristics. Neither F3 nor F^ have a mae performance 
that is statistically significant better than the Euclidean 
baseline. Nevertheless F4 is statistically significant better 
than the Euclidean when it comes to the Sperman's rank 
correlation coefficient. Thus for this meta-mining task there 



is also evidence that we should take a more global approach 
by accounting for all the different constraints on the dataset 
and workflow metrics as F4 does. 

Learning Dataset-Workflow Preferences: The last 
meta-mining task is by far the most difficult one. Here 
we want to predict the appropriateness of a new workflow 
for a new dataset, i.e. the rx,a value. The only metric 
functions that are applicable here are and F4, since 
these are the only ones that are heterogeneous, i.e. they 
can compute a similarity between a dataset and a workflow. 
Note also that the Euclidean baseline strategy is no longer 
applicable because this can only be used between objects 
of the same type. When it comes to the mean average error 
F3 has a very poor performance compared to the default 
strategy. F4 has a considerably better performance than ^3, 
thus providing further support to the incorporation of the 
additional constraints in the objective function, nevertheless 
this performance still is significantly worse than the default 
that of the default strategy. 

Overall we tested a number of new metric-learning-based 
algorithms to solve different variants of the meta-mining 
problem following a hybrid recommendation approach. We 
have two metric-based-learning flavors, the homogeneous 
and the heterogeneous. In the homogeneous flavor we learn 
a metric in the original space in which some objects are 
described, here datasets or workflows, which tries to approx- 
imate a similarity defined over a different space that of the 
relative preference vectors. In the heterogeneous approach 
we learn a metric over the two different spaces that tries to 
reflect directly the goodness of match between the different 
objects. As it turns out the best approach comes from the 
appropriate regularization of the heterogeneous metric by 
exploiting the additional constrains imposed on each of the 
original object spaces. In other words we seek for an het- 
erogeneous metric defined over a common projection space 
of datasets and workflows, where the projection matrices of 
the datasets and workflows are constrained to reflect vector 
preference similarities. In the immediate future we want to 
evaluate the performance of the approach we presented here 
in recommendations problems other than the meta-mining 
with similar problem requirements. 

VI. Related Work 

The meta-mining problem formulation we gave here 
is closely related with the work of hybrid recommender 
systems ||6l, Q, ||8l, ||9l. There the goal is to accurately 
recommend items to users using information on historical 
user preferences and descriptors of items and users. Ex- 
amples of recommender user and item descriptors can be 
found for instance on the MovieLens dataset where we 
have demographic or activity information on users such as 
age, gender and occupation, and taxonomic information on 
movies such as genre and release date. 



State of the art recommender methods ||6l, Q, JS), rely 
on matrix factorization methods to directly approximate 
the preference matrix as we do in the optimization prob- 
lem (|5]l. In im, the authors proposed a Bayesian approach 
where a probabilistic bi-linear rating model is inferred by 
a combination of expectation propagation and variational 
message passing. Users and items features are modelled with 
Gaussian priors into two matrices U and V of latent traits, 
the inner product of which defines user-item similarities. 
Variational approximation on users and items is then used 
to regularize the latent factors. Their experiments on the 
MovieLens dataset showed that including user and item 
descriptors improves performance. ||6l, IQ, propose also 
a generative probabilistic model where the model fitting 
is done by a Monte Carlo EM algorithm with no varia- 
tional approximations. They regularize their model using a 
regression-based approach on user and item factors, where 
the latter are determined using topic modelling, ||6|. They 
also experimented on the MovieLens dataset and showed 
that a model based on the meta-data had a weak predic- 
tive performance while their regression-based approach to 
the latent factors regularization gave the best performance 
improvements. Our metric-learning-based approach to the 
problem of hybrid recommendation uses a very different 
regularization approach in learning the factorization matri- 
ces, we focus on constraining them in a manner that they 
will reflect in the original feature spaces the similarities of 
the respective preference vectors, approach which in our 
meta-mining experiments had the best performance. One 
additional advantage of the use of the two linear projection 
matrices learned in the dataset and workflow spaces is that 
we can now naturally handle the out-of-sample problem, i.e. 
the cold-start problem in recommender systems, which is not 
the case with the typical matrix factorization models. 

VII. Conclusion and Future Work 

In this paper we take a new view on the relatively 
new concept of meta-mining, view that is also relevant 
for the more traditional work of meta-learning. We model 
the problem of the selection of the appropriate workflow 
or algorithm for a dataset as a hybrid recommendation 
problem, in which suggestions will be provided based on 
the descriptors of the dataset and the workflow or algorithm. 
To that end we propose a new metric-learning-based ap- 
proach to the hybrid recommendation problem, which learns 
homogeneous metrics in the original dataset and workflow 
spaces, constrained in a manner that will reflect workflow 
preference and dataset preference vector similarities, and 
combines them with an heterogeneous metric in the dataset- 
workflow space that reflects the appropriateness of a given 
workflow for a given dataset. The two homogeneous metric- 
learning problems act as additional, relevant, regularizers 
for the heterogeneous metric learning problem. In addition 
thanks to the linear projections that lie at the core of our 



method, it is able to handle in a natural manner the cold-start 
problem. The combined use of the three metrics achieves 
the best results. To the best of our knowledge this the first 
approach of its kind, not only for the meta-mining problem, 
but as well as for the more general problem of the hybrid 
recommendation. Our immediate goal is to experiment with 
our approach to standard hybrid recommendation problems, 
such as the MovieLens dataset, and compare its performance 
with typical recommendation approaches used in such prob- 
lems. 
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